Abstract
Here you can find how the processing and plotting related to our B cell receptor clonotype analysis was performed for the paper entitled “Multivalent Antigen Display on Nanoparticles Elicits Diverse and Broad B cell Responses”This Rmarkdown provides the analysis for the processed
data in the paper entitled “Multivalent Antigen Display on Nanoparticles
Diversifies B Cell Responses”. It includes the B cell receptors
analysis, plots, and generation of intermediate files for plotting using
other programs. The processed files used here are AIRR-compliant
datasets which can be downloaded from Zenodo (DOI: 00000000), they are
either Human Respiratory Syncytial Virus-specific (RSV-specific) or the
full bulk repertoire from the vaccinated non-human primates (Macaca
mulatta).
Loaded libraries need to be present, the snakemake
command will try to install them in the conda environment
library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)
Some data used here is stored as .rds format, but it can
also be found as .tsv at Zenodo (DOI: 000000000).
data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
mutate(group = case_when(group == "NP 100%" ~ "20-mer",
group == "NP 50%" ~ "10-mer",
group == "Sol" ~ "1-mer",
group == "PostF" ~ "PostF"))
clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")
# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
filter(timepoint != "PV") %>%
mutate(cdr3_aa_length = nchar(CDR3_aa),
cdr3_aa = CDR3_aa,
v_call = V_gene,
j_call = J_gene,
d_call = D_gene)
# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))
# load mAbs data
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")
# load competition ELISA data
# edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-03-10_LOR_norm-dAUC.csv", header = T, row.names = 1, encoding="UTF-8") %>%
mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF", "w.b.")),
epitope = factor(epitope, levels = c("Ø","V", "Ø/V", "II/V", "III", "IV", "I/IV", "II","I", "UK-Pre", "UK-DP", "UK-PostF", "foldon", "WB")))
# load light chains information clonotyped
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE)
The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Here we show this data using different visualization methods.
Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus having higher titers for all the competitions, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.
cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>%
select(-c(timepoints, ID, group)) %>%
as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)
mds <- D_sim %>%
cmdscale(3) %>%
as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")
mds <- mds %>%
mutate(group = x$group,
timepoints = x$timepoints,
ID = x$ID)
return(mds)
}
The code below plots 4 different MDS plots. They are divided between
2 weeks after Boost 1 (B1) and Boost 2 (B2), with or without animals
vaccinated with PostF immunogen. On the top of each plot, it is written
B1 or B2, and legends say if PostF animals are included or not. The
cosine distance was calculated separately using the custom
cosine_distance() provided above, for that reason, position
of points should not be compared directly between each plot but rather
with points within each plot.
# calculate cosine distance and generate MDS with and without PostF
mds_b1 <- data_comp %>% filter(timepoints == "B1") %>% cosine_distance()
mds_b2 <- data_comp %>% filter(timepoints == "B2") %>% cosine_distance()
mds_no_postf_b1 <- data_comp %>% filter(group != "PostF", timepoints == "B1") %>% cosine_distance()
mds_no_postf_b2 <- data_comp %>% filter(group != "PostF", timepoints == "B2") %>% cosine_distance()
ls_mds <- list(mds_b1, mds_b2, mds_no_postf_b1, mds_no_postf_b2)
for(f in seq_along(ls_mds)){
# Plot and color by groups
mds_plot <- ls_mds[[f]]
if(f == 2){
mds_plot$Dim.1 <- mds_plot$Dim.1 * -1 # flip axis in first dimension
}
plot <- mds_plot %>%
ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) +
geom_point(size = 4, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
lims(x = c(-.5,.5), y = c(-.4, .4)) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
legend.position = c(.1,.2),
legend.background = element_blank(),
legend.box.background = element_rect(colour = "black")) +
facet_wrap(~timepoints)
print(plot)
ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 4, height = 4)
}
data_comp_longer <- data_comp %>%
pivot_longer(cols = 2:9, names_to = "mAb", values_to = "ELISA_competition") %>%
mutate(epitopes = plyr::mapvalues(mAb,
from = c("D25.PreF", "MPE8.PreF", "ADI.PreF", "Pali.PreF", "X101F.PreF",
"ADI.PostF", "X101F.PostF", "Pali.PostF"),
to = c("Ø", "III", "I", "II", "IV",
"I", "IV", "II")),
epitopes = factor(epitopes, levels = c("Ø", "III", "IV", "II", "I")),
conformation = factor(case_when(grepl("PostF", mAb) ~ "PostF",
grepl("PreF", mAb) ~ "PreF"), levels = c("PreF", "PostF")),
timepoint_group = paste(timepoints, group, sep = "_"),
timepoint_group_epitope = paste(timepoints, group, epitopes, sep = "_"))
my_comparisons_sites <- list(c("B1_1-mer", "B1_20-mer"),
c("B1_1-mer", "B1_10-mer"),
c("B1_1-mer", "B1_PostF"),
c("B1_10-mer", "B1_20-mer"),
c("B1_10-mer", "B1_PostF"),
c("B1_20-mer", "B1_PostF"),
c("B2_1-mer", "B2_20-mer"),
c("B2_1-mer", "B2_10-mer"),
c("B2_1-mer", "B2_PostF"),
c("B2_10-mer", "B2_20-mer"),
c("B2_10-mer", "B2_PostF"),
c("B2_20-mer", "B2_PostF"))
my_comparisons_1 <- lapply(my_comparisons_sites, paste0, "_I")
my_comparisons_2 <- lapply(my_comparisons_sites, paste0, "_II")
my_comparisons_3 <- lapply(my_comparisons_sites, paste0, "_III")
my_comparisons_4 <- lapply(my_comparisons_sites, paste0, "_IV")
my_comparisons_5 <- lapply(my_comparisons_sites, paste0, "_Ø")
# Selecting sites for statistical comparison only between groups for each timepoint
# removed statistical comparisons for Boost 1 site I in PreF since most values were below the threshold of detection
my_comparisons_preF <- c(my_comparisons_1[7:12], my_comparisons_2, my_comparisons_3, my_comparisons_4, my_comparisons_5)
my_comparisons_postF <- c(my_comparisons_1, my_comparisons_2, my_comparisons_4)
stat.test <- data_comp_longer %>%
filter(conformation == "PreF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_preF)
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|
| ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 14.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 20.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.190 | ns |
| ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 6.0 | 0.205 | 0.527 | ns |
| ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 2.0 | 0.064 | 0.343 | ns |
| ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.0 | 0.111 | 0.428 | ns |
| ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 12.0 | 0.343 | 0.772 | ns |
| ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.0 | 0.222 | 0.545 | ns |
| ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 19.0 | 0.032 | 0.190 | ns |
| ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 6.0 | 0.686 | 0.939 | ns |
| ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 9.0 | 0.548 | 0.883 | ns |
| ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 6.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 9.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B1_1-mer_III | B1_20-mer_III | 4 | 5 | 7.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B1_1-mer_III | B1_10-mer_III | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_III | B1_PostF_III | 4 | 4 | 13.5 | 0.124 | 0.446 | ns |
| ELISA_competition | B1_10-mer_III | B1_20-mer_III | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B1_10-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
| ELISA_competition | B1_20-mer_III | B1_PostF_III | 5 | 4 | 20.0 | 0.018 | 0.176 | ns |
| ELISA_competition | B2_1-mer_III | B2_20-mer_III | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B2_1-mer_III | B2_10-mer_III | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_III | B2_PostF_III | 4 | 4 | 11.0 | 0.486 | 0.883 | ns |
| ELISA_competition | B2_10-mer_III | B2_20-mer_III | 5 | 5 | 8.0 | 0.421 | 0.812 | ns |
| ELISA_competition | B2_10-mer_III | B2_PostF_III | 5 | 4 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_20-mer_III | B2_PostF_III | 5 | 4 | 16.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 4.0 | 0.190 | 0.513 | ns |
| ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.812 | ns |
| ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 8.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 11.0 | 0.841 | 0.958 | ns |
| ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 8.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 11.0 | 0.905 | 0.958 | ns |
| ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 7.0 | 0.886 | 0.958 | ns |
| ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 4.0 | 0.095 | 0.428 | ns |
| ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 5.0 | 0.286 | 0.671 | ns |
| ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 12.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_20-mer_Ø | 4 | 5 | 11.0 | 0.898 | 0.958 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_10-mer_Ø | 4 | 5 | 12.0 | 0.701 | 0.939 | ns |
| ELISA_competition | B1_1-mer_Ø | B1_PostF_Ø | 4 | 4 | 12.0 | 0.186 | 0.513 | ns |
| ELISA_competition | B1_10-mer_Ø | B1_20-mer_Ø | 5 | 5 | 10.0 | 0.666 | 0.939 | ns |
| ELISA_competition | B1_10-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
| ELISA_competition | B1_20-mer_Ø | B1_PostF_Ø | 5 | 4 | 16.0 | 0.109 | 0.428 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_20-mer_Ø | 4 | 5 | 13.0 | 0.556 | 0.883 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_10-mer_Ø | 4 | 5 | 12.0 | 0.730 | 0.939 | ns |
| ELISA_competition | B2_1-mer_Ø | B2_PostF_Ø | 4 | 4 | 16.0 | 0.029 | 0.190 | ns |
| ELISA_competition | B2_10-mer_Ø | B2_20-mer_Ø | 5 | 5 | 13.0 | 1.000 | 1.000 | ns |
| ELISA_competition | B2_10-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
| ELISA_competition | B2_20-mer_Ø | B2_PostF_Ø | 5 | 4 | 20.0 | 0.020 | 0.176 | ns |
data_comp_longer %>%
filter(conformation == "PreF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PreF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
stat.test <- data_comp_longer %>%
filter(conformation == "PostF") %>%
wilcox_test(formula = ELISA_competition ~ timepoint_group_epitope,
paired = FALSE,
p.adjust.method = "fdr",
comparisons = my_comparisons_postF) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ELISA_competition | B1_1-mer_I | B1_20-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 6533.097 | B1_1-mer…. | 1 | 7 |
| ELISA_competition | B1_1-mer_I | B1_10-mer_I | 4 | 5 | 0.0 | 0.020 | 0.088 | ns | 7250.111 | B1_1-mer…. | 1 | 4 |
| ELISA_competition | B1_1-mer_I | B1_PostF_I | 4 | 4 | 1.0 | 0.059 | 0.127 | ns | 7967.125 | B1_1-mer…. | 1 | 10 |
| ELISA_competition | B1_10-mer_I | B1_20-mer_I | 5 | 5 | 3.0 | 0.056 | 0.127 | ns | 8684.139 | B1_10-me…. | 4 | 7 |
| ELISA_competition | B1_10-mer_I | B1_PostF_I | 5 | 4 | 9.0 | 0.905 | 0.905 | ns | 9401.153 | B1_10-me…. | 4 | 10 |
| ELISA_competition | B1_20-mer_I | B1_PostF_I | 5 | 4 | 12.0 | 0.730 | 0.773 | ns | 10118.168 | B1_20-me…. | 7 | 10 |
| ELISA_competition | B2_1-mer_I | B2_20-mer_I | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 10835.182 | B2_1-mer…. | 13 | 19 |
| ELISA_competition | B2_1-mer_I | B2_10-mer_I | 4 | 5 | 5.0 | 0.286 | 0.381 | ns | 11552.196 | B2_1-mer…. | 13 | 16 |
| ELISA_competition | B2_1-mer_I | B2_PostF_I | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 12269.210 | B2_1-mer…. | 13 | 22 |
| ELISA_competition | B2_10-mer_I | B2_20-mer_I | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 12986.224 | B2_10-me…. | 16 | 19 |
| ELISA_competition | B2_10-mer_I | B2_PostF_I | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 13703.238 | B2_10-me…. | 16 | 22 |
| ELISA_competition | B2_20-mer_I | B2_PostF_I | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 14420.252 | B2_20-me…. | 19 | 22 |
| ELISA_competition | B1_1-mer_II | B1_20-mer_II | 4 | 5 | 3.5 | 0.125 | 0.205 | ns | 15137.266 | B1_1-mer…. | 2 | 8 |
| ELISA_competition | B1_1-mer_II | B1_10-mer_II | 4 | 5 | 8.5 | 0.771 | 0.793 | ns | 15854.280 | B1_1-mer…. | 2 | 5 |
| ELISA_competition | B1_1-mer_II | B1_PostF_II | 4 | 4 | 0.0 | 0.026 | 0.088 | ns | 16571.294 | B1_1-mer…. | 2 | 11 |
| ELISA_competition | B1_10-mer_II | B1_20-mer_II | 5 | 5 | 6.5 | 0.236 | 0.327 | ns | 17288.309 | B1_10-me…. | 5 | 8 |
| ELISA_competition | B1_10-mer_II | B1_PostF_II | 5 | 4 | 1.0 | 0.034 | 0.088 | ns | 18005.323 | B1_10-me…. | 5 | 11 |
| ELISA_competition | B1_20-mer_II | B1_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 18722.337 | B1_20-me…. | 8 | 11 |
| ELISA_competition | B2_1-mer_II | B2_20-mer_II | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 19439.351 | B2_1-mer…. | 14 | 20 |
| ELISA_competition | B2_1-mer_II | B2_10-mer_II | 4 | 5 | 4.0 | 0.190 | 0.297 | ns | 20156.365 | B2_1-mer…. | 14 | 17 |
| ELISA_competition | B2_1-mer_II | B2_PostF_II | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 20873.379 | B2_1-mer…. | 14 | 23 |
| ELISA_competition | B2_10-mer_II | B2_20-mer_II | 5 | 5 | 8.0 | 0.421 | 0.474 | ns | 21590.393 | B2_10-me…. | 17 | 20 |
| ELISA_competition | B2_10-mer_II | B2_PostF_II | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 22307.407 | B2_10-me…. | 17 | 23 |
| ELISA_competition | B2_20-mer_II | B2_PostF_II | 5 | 4 | 2.0 | 0.064 | 0.127 | ns | 23024.421 | B2_20-me…. | 20 | 23 |
| ELISA_competition | B1_1-mer_IV | B1_20-mer_IV | 4 | 5 | 1.0 | 0.032 | 0.088 | ns | 23741.435 | B1_1-mer…. | 3 | 9 |
| ELISA_competition | B1_1-mer_IV | B1_10-mer_IV | 4 | 5 | 4.5 | 0.219 | 0.320 | ns | 24458.449 | B1_1-mer…. | 3 | 6 |
| ELISA_competition | B1_1-mer_IV | B1_PostF_IV | 4 | 4 | 2.0 | 0.114 | 0.195 | ns | 25175.464 | B1_1-mer…. | 3 | 12 |
| ELISA_competition | B1_10-mer_IV | B1_20-mer_IV | 5 | 5 | 6.0 | 0.222 | 0.320 | ns | 25892.478 | B1_10-me…. | 6 | 9 |
| ELISA_competition | B1_10-mer_IV | B1_PostF_IV | 5 | 4 | 6.0 | 0.413 | 0.474 | ns | 26609.492 | B1_10-me…. | 6 | 12 |
| ELISA_competition | B1_20-mer_IV | B1_PostF_IV | 5 | 4 | 13.0 | 0.556 | 0.607 | ns | 27326.506 | B1_20-me…. | 9 | 12 |
| ELISA_competition | B2_1-mer_IV | B2_20-mer_IV | 4 | 5 | 0.0 | 0.016 | 0.088 | ns | 28043.520 | B2_1-mer…. | 15 | 21 |
| ELISA_competition | B2_1-mer_IV | B2_10-mer_IV | 4 | 5 | 6.0 | 0.413 | 0.474 | ns | 28760.534 | B2_1-mer…. | 15 | 18 |
| ELISA_competition | B2_1-mer_IV | B2_PostF_IV | 4 | 4 | 0.0 | 0.029 | 0.088 | ns | 29477.548 | B2_1-mer…. | 15 | 24 |
| ELISA_competition | B2_10-mer_IV | B2_20-mer_IV | 5 | 5 | 7.0 | 0.310 | 0.385 | ns | 30194.562 | B2_10-me…. | 18 | 21 |
| ELISA_competition | B2_10-mer_IV | B2_PostF_IV | 5 | 4 | 0.0 | 0.016 | 0.088 | ns | 30911.576 | B2_10-me…. | 18 | 24 |
| ELISA_competition | B2_20-mer_IV | B2_PostF_IV | 5 | 4 | 3.0 | 0.111 | 0.195 | ns | 31628.590 | B2_20-me…. | 21 | 24 |
data_comp_longer %>%
filter(conformation == "PostF") %>%
ggplot(aes(x = timepoint_group,
y = ELISA_competition,
color = group,
fill = group)) +
geom_point(size = 2, shape = 21) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = .5) +
scale_y_log10() +
stat_summary(fun.y = mean,
geom = "crossbar",
color = "black") +
geom_hline(yintercept = 10, linetype = "dashed") +
labs(title = "PostF Antigenic Sites", y = "50% Competition Titer", x = "") +
theme(axis.ticks = element_line(size = .5),
legend.background = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size = 7)) +
facet_wrap(~epitopes, nrow = 1)
Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.
rep_seq_stat <- lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <- lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE)
rep_seq_stat <- rep_seq_stat %>%
select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))
write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )
After merging the antigen-specific sequences and bulk sequencing, we
run the the clonotype module from IgDiscover to assign
clonotype grouping to each sequence. We have done that for both the
individualized germline and the KIMDB database. Here we are reading
those datasets
ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls) & grepl("individualized|nestor-rm", ls) ]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
mutate(
Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
ID = gsub("_.*", "", ID_timepoint),
database = plyr::mapvalues(.id,
from = c("nestor-rm", "individualized"),
to = c("KIMDB", "Individualized")
),
database = factor(database, levels = c("KIMDB", "Individualized")),
Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
v_call = ifelse(is.na(v_call), V_gene, v_call),
j_call = ifelse(is.na(j_call), J_gene, j_call)
) %>%
group_by(.id, ID_timepoint) %>%
distinct(new_name, .keep_all = TRUE) %>%
ungroup() %>%
filter(database %in% c("KIMDB", "Individualized"))
Plotting sequencing depth for each timepoint per animal. This indicates that the sequencing depth for Boost 1 was lower, thus for that reason all the analysis were normalized by sequencing depth to take that into account.
# samples from BR2-B2 from E17 were merged with TR1-B2
# this was done due to low sequencing depth for that animal and large expansion of uncharacterized clones
full_rep_indiv <- read.table("../data/processed_data/summary_sequencing_table.tsv",
header = TRUE) %>%
separate(ID, sep = "_", into = c("sample", "ID")) %>%
filter(sample != "TR2-B2") %>%
mutate(sample = ifelse(sample == "BR2-B2", "TR1-B2", sample),
timepoint = case_when(sample == "igm" ~ "PV",
grepl("B1", sample) ~ "B1",
grepl("B2", sample) ~ "B2"),
timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
group_by(timepoint, ID) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup()
# since here we wanted to compare all groups, we used Dunn's test
stat.test <- full_rep_indiv %>%
dunn_test(formula = assignment_filtering.has_cdr3 ~ timepoint,
p.adjust.method = "fdr") %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| assignment_filtering.has_cdr3 | PV | B1 | 9 | 9 | -3.6525703 | 0.0002596 | 0.0007789 | *** | 1393265 | PV, B1 | 1 | 2 |
| assignment_filtering.has_cdr3 | PV | B2 | 9 | 9 | -0.2672612 | 0.7892680 | 0.7892680 | ns | 1491771 | PV, B2 | 1 | 3 |
| assignment_filtering.has_cdr3 | B1 | B2 | 9 | 9 | 3.3853091 | 0.0007110 | 0.0010665 | ** | 1590276 | B1, B2 | 2 | 3 |
full_rep_indiv %>%
ggdotplot(y = "assignment_filtering.has_cdr3", x = "timepoint",
group = "timepoint",
fill = "ID",
size = 1,) +
geom_boxplot(alpha = .2, outlier.shape = NA) +
ggpubr::stat_compare_means(method = "kruskal") +
labs(x = "Timepoint", y = "# good quality aligned sequences") +
scale_fill_viridis_d() +
stat_pvalue_manual(stat.test %>% mutate(p.adj = round(p.adj, 4)), label = "p.adj") +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)
rds_summary <- rds_merge %>%
group_by(database, ID_timepoint, grp) %>%
summarise(
ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
clonal_size = n(),
database,
sc_clone_grp = first(sc_clone_grp),
V_errors = mean(V_errors),
specificity_group = first(specificity_group),
cdr3_aa_length = mean(cdr3_aa_length),
v_call = first(v_call),
j_call = first(j_call)
) %>%
ungroup() %>%
group_by(database, ID_timepoint) %>%
mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
ungroup() %>%
distinct()
rds_summary_noPV <- rds_summary %>%
filter(Timepoint != "PV", Timepoint != "Single-cell")
rds_summary_save <- rds_summary %>%
filter(Timepoint %in% c("Single-cell", "B1","B2"),
database == "Individualized") %>%
group_by(ID_timepoint) %>%
summarise(mean_clonal_size = mean(clonal_size),
median_clonal_size = median(clonal_size),
geom_clonal_size = exp(mean(log(clonal_size))),
unique_clones = sum(clonal_size == 1),
total_clones_detected = n(),
percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2))
rds_summary_save %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| ID_timepoint | mean_clonal_size | median_clonal_size | geom_clonal_size | unique_clones | total_clones_detected | percentage_unique_clones |
|---|---|---|---|---|---|---|
| E11_B1 | 73.765766 | 24.0 | 23.618580 | 7 | 111 | 6.31 |
| E11_B2 | 65.842324 | 18.0 | 16.139896 | 22 | 241 | 9.13 |
| E11_Single-cell | 1.596491 | 1.0 | 1.310902 | 169 | 228 | 74.12 |
| E12_B1 | 96.802632 | 8.5 | 11.635387 | 5 | 76 | 6.58 |
| E12_B2 | 62.696296 | 9.0 | 9.733452 | 18 | 135 | 13.33 |
| E12_Single-cell | 1.556000 | 1.0 | 1.304827 | 185 | 250 | 74.00 |
| E14_B1 | 79.458333 | 9.0 | 9.683393 | 11 | 96 | 11.46 |
| E14_B2 | 111.530612 | 10.5 | 12.335661 | 29 | 196 | 14.80 |
| E14_Single-cell | 2.041096 | 1.0 | 1.527244 | 139 | 219 | 63.47 |
| E16_B1 | 73.535519 | 14.0 | 15.223779 | 14 | 183 | 7.65 |
| E16_B2 | 118.491228 | 31.5 | 29.123260 | 17 | 228 | 7.46 |
| E16_Single-cell | 1.996094 | 1.0 | 1.489967 | 169 | 256 | 66.02 |
| E17_B1 | 38.460526 | 8.0 | 9.753009 | 8 | 76 | 10.53 |
| E17_B2 | 55.614035 | 6.0 | 7.902730 | 31 | 171 | 18.13 |
| E17_Single-cell | 1.665158 | 1.0 | 1.381309 | 151 | 221 | 68.33 |
| E18_B1 | 41.323529 | 11.0 | 11.772174 | 9 | 102 | 8.82 |
| E18_B2 | 72.662921 | 21.0 | 15.643111 | 23 | 178 | 12.92 |
| E18_Single-cell | 1.641026 | 1.0 | 1.368241 | 135 | 195 | 69.23 |
| E21_B1 | 35.201835 | 5.0 | 6.887328 | 21 | 109 | 19.27 |
| E21_B2 | 45.202312 | 14.0 | 13.455144 | 16 | 173 | 9.25 |
| E21_Single-cell | 1.674208 | 1.0 | 1.385378 | 151 | 221 | 68.33 |
| E23_B1 | 59.675497 | 9.0 | 11.417886 | 19 | 151 | 12.58 |
| E23_B2 | 63.044025 | 13.0 | 14.180265 | 18 | 159 | 11.32 |
| E23_Single-cell | 1.916279 | 1.0 | 1.484490 | 136 | 215 | 63.26 |
| E24_B1 | 55.469231 | 10.0 | 11.115858 | 18 | 130 | 13.85 |
| E24_B2 | 62.310735 | 9.0 | 10.429075 | 29 | 177 | 16.38 |
| E24_Single-cell | 1.829897 | 1.0 | 1.400170 | 137 | 194 | 70.62 |
write.csv(rds_summary_save,
paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)
rds_indiv <- rds_summary %>%
filter(
database == "Individualized",
Timepoint != "Single-cell",
Timepoint != "PV"
) %>%
mutate(
LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
)
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
tidyr::expand(ID, Timepoint, grp) %>%
filter(Timepoint != "Single-cell")
rds_indiv <- rds_indiv %>%
right_join(all) %>%
mutate(
ID_timepoint = paste(ID, Timepoint, sep = "_"),
database = "individualized",
clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
) %>%
arrange(grp, ID_timepoint) %>%
tidyr::fill(LOR, Group, sc_clone_grp)
rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group)
ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)
rds_indiv %>%
group_by(Group, ID, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 30000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group + ID, ncol = 5)
df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
for (databases in unique(rds_summary_noPV$database)) {
rds_timepoint <- rds_summary_noPV %>%
filter(Timepoint == timepoints, database == databases) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
rds_indiv_corr <- rds_summary_noPV %>%
filter(
database == "Individualized",
Timepoint == timepoints
) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
length(rds_indiv_corr) <- length(rds_timepoint)
df <- data.frame(
size_indiv = rds_indiv_corr,
clonal_size = rds_timepoint,
Timepoint = timepoints,
database = databases
)
df[is.na(df)] <- 0
df_all <- rbind(df, df_all)
}
}
df_all %>%
filter(database != "Individualized") %>%
mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
ggplot(aes(x = size_indiv, y = clonal_size)) +
geom_point(size = 1) +
geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
scale_x_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
scale_y_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ Timepoint)
ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)
for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
# check if we want to filter clonotypes by size or not
filter(database == i) %>%
rbind(., within(., specificity_group <- "Total")) %>%
mutate(v_call = sub("\\*.*","", v_call),
j_call = sub("\\*.*|-.*","", j_call),
clonal_size_log = log10(clonal_size),
Group = factor(Group, levels = c("20-mer", "1-mer")),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
group_by(Group) %>%
mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
ungroup() %>%
group_by(Group, specificity_group, v_call, j_call) %>%
summarise(clonal_size_perc = sum(clonal_size_perc)) %>%
filter(specificity_group != "PostF") %>% droplevels()
plot_vj <- rds_summary_noPV %>%
ggplot(aes(x= v_call, y = j_call, fill = clonal_size_perc)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
coord_equal() +
theme(axis.text.x = element_text(angle = 90, size = 5, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 5),
strip.text = element_text(face = "bold", size = 7),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top",
axis.ticks = element_line(size = .5),
legend.title = element_text()) +
labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles", title = paste0(i, " Database")) +
facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")
print(plot_vj)
ggsave(plot = plot_vj, filename = paste0("../",result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}
# groups to perform statistical comparisons
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
c("Total_B2_20-mer", "Total_B2_1-mer"),
c("PreF_B1_20-mer", "PreF_B1_1-mer"),
c("PreF_B2_20-mer", "PreF_B2_1-mer"),
c("DP_B1_20-mer", "DP_B1_1-mer"),
c("DP_B2_20-mer", "DP_B2_1-mer"))
rds_summary_noPV <- rds_summary %>%
# If want to remove clonal sizer < 1, do it here.
filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls) %>%
summarise(unique_v_j = n()) %>%
ungroup()
rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)
stat.test <- rds_summary_noPV %>%
wilcox_test(formula = unique_v_j ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unique_v_j | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 14 | 0.413 | 0.496 | ns | 137.040 | Total_B1…. | 1 | 2 |
| unique_v_j | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 11 | 0.901 | 0.901 | ns | 150.288 | Total_B2…. | 3 | 4 |
| unique_v_j | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 4 | 0.176 | 0.264 | ns | 163.536 | PreF_B1_…. | 5 | 6 |
| unique_v_j | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 0 | 0.016 | 0.048 |
|
176.784 | PreF_B2_…. | 7 | 8 |
| unique_v_j | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
190.032 | DP_B1_20…. | 9 | 10 |
| unique_v_j | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 18 | 0.064 | 0.127 | ns | 203.280 | DP_B2_20…. | 11 | 12 |
rds_summary_noPV %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)
rds_summary_noPV <- rds_summary %>%
# decide if clonal size greater than 1 should be included
filter(database == "Individualized", Timepoint != "Single-cell") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls)
list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
"1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])
ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7,
label_alpha = 0,
set_color = "black",
set_size = 9) +
scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
scale_color_manual(values = c("#5F90B0", "#D896C1")) +
theme(legend.position = "none")
ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)
Intermediate files are commented out, so it will not save all of
them. One could uncomment them if all the intermediate files are needed.
Intermediate files were used to run Recon (v2.5) (Kaplinsky
& Arnaout, Nat Commmun, 2016) according to default
parameters, more about running Recon is described in the next section Recon estimates for
RSV-specific diversity.
# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))
# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()
# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
# save full clonotype files
filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
filter(ID_timepoint_spec == animals) %>%
as.data.frame()
# write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)
# save summary files
clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
#write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
# save summary files
clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
filter(ID_timepoint == animals) %>%
as.data.frame() %>%
mutate(
specificity_group = "Total",
ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
) %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()
for (animals in unique(clonotype_rsv$ID)) {
# save summary files
clonotype_rsv %>%
filter(ID == animals) %>%
as.data.frame() %>%
group_by(grp, timepoint, ID) %>%
summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
select(clonal_size, ID_timepoint_spec) %>%
group_by(clonal_size, ID_timepoint_spec) %>%
summarise(size = n()) %>%
ungroup()
for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
to_recon_table <- to_recon_rsv %>%
filter(ID_timepoint_spec == i) %>%
select(clonal_size, size)
# fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Species richness (Chao1) was calculated using the vegan
r package. The samples were first subsampled 100 times for each
animal/timepoint and then the species richness was estimated. The mean
of those 100x replicates were used for plotting.
Here is the processing and estimation of species richness:
rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))
# subsample and replicate the subsampling 100 times for higher accuracy
chaox100 <- function(x, value_to_subsample) {
replicate(100, {
subsample <- vegan::rrarefy(x, value_to_subsample)
chao <- vegan::estimateR(subsample)
return(chao)
})
}
diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))
# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
print(spec_timepoint)
rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
names(x) <- gsub("_.*", "", names(x))
x
}
# adjusted dataset for plotting
{
chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
# save intermediate file
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
# diversity mean of x100 replicates per animal
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
group_by(algorithm, Timepoint_specificity) %>%
summarise_all(.funs = mean) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)
chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
))
}
Here is the plotting and comparison between vaccinated group:
chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"
chao_results_df <- chao_results_df %>%
filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
summarise(value = mean(value)) %>%
tidyr::drop_na() %>%
ungroup()
write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)
stat.test <- chao_results_df %>%
wilcox_test(formula = value ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| value | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.285 | ns | 272.3352 | Total_B1…. | 1 | 2 |
| value | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 15 | 0.286 | 0.343 | ns | 301.8278 | Total_B2…. | 3 | 4 |
| value | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 331.3205 | PreF_B1_…. | 5 | 6 |
| value | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 360.8131 | PreF_B2_…. | 7 | 8 |
| value | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
390.3058 | DP_B1_20…. | 9 | 10 |
| value | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
419.7984 | DP_B2_20…. | 11 | 12 |
chao_results_df %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "value",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Chao1)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)
According to Recon default settings and tutorial (check
Recon manual), the count files generated previously were used to create
the fitfiles.txt that were used as an input for generating
the diversity_table.txt for all the samples.
To generate the the fitfile for each count file, the bash script used in a for loop was:
#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE
python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt
Then, each fit file was used for generating the
diversity_table.txt with the command:
python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt
The results from diversity_table.txt for all the samples
were used as input for plotting.
recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
ungroup()
write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
stat.test <- recon_res %>%
wilcox_test(formula = est_0.0D ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| est_0.0D | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 275.7082 | Total_B1…. | 1 | 2 |
| est_0.0D | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 304.8120 | Total_B2…. | 3 | 4 |
| est_0.0D | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 333.9159 | PreF_B1_…. | 5 | 6 |
| est_0.0D | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 363.0197 | PreF_B2_…. | 7 | 8 |
| est_0.0D | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
392.1236 | DP_B1_20…. | 9 | 10 |
| est_0.0D | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
421.2274 | DP_B2_20…. | 11 | 12 |
recon_res %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 1) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Recon)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)
Here the SHM is shown for all the sequences for every animal combined. Data is divided by 20-mer, 1-mer, and specificities. T
rds_indiv_total <- rds_indiv %>%
mutate(specificity_group = "Total")
rds_indiv <- rbind(rds_indiv, rds_indiv_total)
rds_indiv_plot <- rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup()
stat.test <- rds_indiv_plot %>%
wilcox_test(formula = V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V_errors | Total_B1_20-mer | Total_B1_1-mer | 651 | 383 | 119282.5 | 0.246 | 0.492 | ns | 68.320 | Total_B1…. | 1 | 2 |
| V_errors | Total_B2_20-mer | Total_B2_1-mer | 976 | 682 | 349160.0 | 0.088 | 0.265 | ns | 73.504 | Total_B2…. | 3 | 4 |
| V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 205 | 205 | 21279.5 | 0.824 | 0.841 | ns | 78.688 | PreF_B1_…. | 5 | 6 |
| V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 346 | 389 | 69762.5 | 0.391 | 0.587 | ns | 83.872 | PreF_B2_…. | 7 | 8 |
| V_errors | DP_B1_20-mer | DP_B1_1-mer | 413 | 144 | 29401.0 | 0.841 | 0.841 | ns | 89.056 | DP_B1_20…. | 9 | 10 |
| V_errors | DP_B2_20-mer | DP_B2_1-mer | 583 | 250 | 79353.0 | 0.042 | 0.251 | ns | 94.240 | DP_B2_20…. | 11 | 12 |
rds_indiv_plot %>%
ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)
Here the SHM data is summarized per macaque, so each dot represents the average SHM of all the antigen-specific sequences for one animal.
rds_indiv_plot_summ <- rds_indiv_plot %>%
filter(Group_specificity != "PostF") %>%
group_by(ID, Group_specificity) %>%
summarise(avg_V_errors = mean(V_errors, na.rm = TRUE)) %>%
ungroup() %>%
tidyr::drop_na()
stat.test <- rds_indiv_plot_summ %>%
wilcox_test(formula = avg_V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| avg_V_errors | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 4 | 0.190 | 0.826 | ns | 20.4600 | Total_B1…. | 1 | 2 |
| avg_V_errors | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 12 | 0.730 | 0.876 | ns | 22.2204 | Total_B2…. | 3 | 4 |
| avg_V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 11 | 0.905 | 0.905 | ns | 23.9808 | PreF_B1_…. | 5 | 6 |
| avg_V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 12 | 0.730 | 0.876 | ns | 25.7412 | PreF_B2_…. | 7 | 8 |
| avg_V_errors | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 5 | 0.286 | 0.826 | ns | 27.5016 | DP_B1_20…. | 9 | 10 |
| avg_V_errors | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 14 | 0.413 | 0.826 | ns | 29.2620 | DP_B2_20…. | 11 | 12 |
rds_indiv_plot_summ %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "avg_V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 25)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
stat_summary(fun.y = mean,
geom = "crossbar") +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 22) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(
Group_specificity = paste(Group, specificity_group, sep = "_"),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
Timepoint = factor(Timepoint, levels = c("B1", "B2"))
) %>%
ggplot(aes(x = cdr3_aa_length, fill = Group)) +
geom_density(alpha = .7) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
labs(y = "Density", x = "HCDR3 aa length") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")
ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)
kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")
joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
## width seq names
## [1] 296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
## [2] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
## [3] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [4] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [5] 298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
## ... ... ...
## [11] 296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12] 299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13] 296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14] 299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15] 297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
type = "Not validated"
) %>%
dplyr::rename(
v_count = v_unique.Freq,
ID = v_unique.Var1
)
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
group_by(ID) %>%
arrange(ID, .by_group = TRUE) %>%
mutate(
diff = v_count - lag(v_count, default = last(v_count)),
diff = ifelse(diff < 0, v_count, diff)
)
kimdb_v_indiv %>%
write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)
kimdb_v_indiv %>%
mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
ggplot(aes(x = ID, y = diff, fill = type)) +
geom_col(color = "black") +
scale_fill_viridis_d(direction = -1) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_discrete(expand = c(0, 0)) +
labs(y = "V alelle counts", x = "Animal ID") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)
clonal_tree_data <- clonotype_rsv %>%
ungroup() %>%
distinct(new_name, .keep_all = TRUE) %>%
group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
add_tally(name = "clonal_size") %>%
ungroup()
mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03")
# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")
# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")
sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA heavy chain
UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
names(UCA) <- paste0(mab_name,"_UCA")
# select sequences part of the same clonotype
group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clonal_tree_data %>%
filter(grp == group)
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
row.names = FALSE)
}
# deduplicating sequences
fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
#adding single cell sequences
sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
#saving duplicated
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
fasta <- c(fasta, sc_seqs, UCA)
fasta <- fasta[unique(names(fasta))]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA light chain
UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
# select light chain clonotypes
group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clono_light_chains %>%
filter(grp == group, substr(name,1,3) == substr(mab,1,3))
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
row.names = FALSE)
}
# save object as BStringset
fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
# save duplicated values
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
fasta <- c(fasta, UCA_LC)
fasta <- fasta[names(fasta)]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
Do a Multiple Sequence Alignment using MUSCLE (v 5.1)
for all the fasta files generated through out this analysis. Following
that, run FastTree (v 2.1.11) for all the aligned sequences
and save tree output to be plotted on the following code.
source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/
for f in *.fasta; do muscle -align $f -output $f.aln; done
for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 217 seqs, avg length 364, max 370
##
## 00:00 4.6Mb CPU has 8 cores, running 8 threads
## 00:00 4.8Mb 0.0043% Calc posteriors
00:01 47Mb 0.038% Calc posteriors
00:02 172Mb 0.66% Calc posteriors
00:03 192Mb 1.3% Calc posteriors
00:04 210Mb 2.0% Calc posteriors
00:05 227Mb 2.6% Calc posteriors
00:06 233Mb 3.3% Calc posteriors
00:07 237Mb 3.9% Calc posteriors
00:08 241Mb 4.6% Calc posteriors
00:09 256Mb 5.2% Calc posteriors
00:10 257Mb 5.8% Calc posteriors
00:11 271Mb 6.3% Calc posteriors
00:12 292Mb 6.9% Calc posteriors
00:13 336Mb 7.4% Calc posteriors
00:14 373Mb 7.9% Calc posteriors
00:15 392Mb 8.5% Calc posteriors
00:16 446Mb 9.0% Calc posteriors
00:17 457Mb 9.6% Calc posteriors
00:18 505Mb 10.3% Calc posteriors
00:19 529Mb 10.8% Calc posteriors
00:20 582Mb 11.4% Calc posteriors
00:21 593Mb 12.0% Calc posteriors
00:22 624Mb 12.6% Calc posteriors
00:23 658Mb 13.1% Calc posteriors
00:24 674Mb 13.7% Calc posteriors
00:25 701Mb 14.3% Calc posteriors
00:26 712Mb 14.8% Calc posteriors
00:27 716Mb 15.4% Calc posteriors
00:28 731Mb 16.0% Calc posteriors
00:29 747Mb 16.5% Calc posteriors
00:30 759Mb 17.2% Calc posteriors
00:31 771Mb 17.8% Calc posteriors
00:32 780Mb 18.4% Calc posteriors
00:33 790Mb 19.0% Calc posteriors
00:34 791Mb 19.6% Calc posteriors
00:35 808Mb 20.3% Calc posteriors
00:36 815Mb 20.8% Calc posteriors
00:37 830Mb 21.5% Calc posteriors
00:38 860Mb 22.0% Calc posteriors
00:39 870Mb 22.6% Calc posteriors
00:40 767Mb 23.2% Calc posteriors
00:41 790Mb 23.7% Calc posteriors
00:42 798Mb 24.2% Calc posteriors
00:43 800Mb 24.8% Calc posteriors
00:44 814Mb 25.4% Calc posteriors
00:45 816Mb 26.1% Calc posteriors
00:46 817Mb 26.7% Calc posteriors
00:47 820Mb 27.2% Calc posteriors
00:48 833Mb 27.8% Calc posteriors
00:49 850Mb 28.4% Calc posteriors
00:50 862Mb 28.9% Calc posteriors
00:51 876Mb 29.4% Calc posteriors
00:52 886Mb 30.0% Calc posteriors
00:53 776Mb 30.5% Calc posteriors
00:54 797Mb 31.0% Calc posteriors
00:55 804Mb 31.6% Calc posteriors
00:56 700Mb 32.1% Calc posteriors
00:57 701Mb 32.6% Calc posteriors
00:58 710Mb 33.1% Calc posteriors
00:59 728Mb 33.6% Calc posteriors
01:00 743Mb 34.2% Calc posteriors
01:01 767Mb 34.7% Calc posteriors
01:02 777Mb 35.3% Calc posteriors
01:03 789Mb 35.8% Calc posteriors
01:04 801Mb 36.3% Calc posteriors
01:05 806Mb 36.9% Calc posteriors
01:06 828Mb 37.5% Calc posteriors
01:07 848Mb 38.1% Calc posteriors
01:08 887Mb 38.7% Calc posteriors
01:09 755Mb 39.3% Calc posteriors
01:10 769Mb 39.8% Calc posteriors
01:11 776Mb 40.4% Calc posteriors
01:12 785Mb 41.0% Calc posteriors
01:13 799Mb 41.5% Calc posteriors
01:14 819Mb 42.1% Calc posteriors
01:15 831Mb 42.6% Calc posteriors
01:16 849Mb 43.2% Calc posteriors
01:17 861Mb 43.9% Calc posteriors
01:18 878Mb 44.5% Calc posteriors
01:19 899Mb 45.1% Calc posteriors
01:20 917Mb 45.7% Calc posteriors
01:21 945Mb 46.3% Calc posteriors
01:22 820Mb 46.8% Calc posteriors
01:23 848Mb 47.4% Calc posteriors
01:24 849Mb 47.9% Calc posteriors
01:25 850Mb 48.5% Calc posteriors
01:26 859Mb 49.0% Calc posteriors
01:27 863Mb 49.6% Calc posteriors
01:28 744Mb 50.2% Calc posteriors
01:29 753Mb 50.8% Calc posteriors
01:30 762Mb 51.4% Calc posteriors
01:31 771Mb 51.9% Calc posteriors
01:32 799Mb 52.5% Calc posteriors
01:33 801Mb 53.1% Calc posteriors
01:34 802Mb 53.7% Calc posteriors
01:35 811Mb 54.3% Calc posteriors
01:36 820Mb 54.8% Calc posteriors
01:37 821Mb 55.5% Calc posteriors
01:38 823Mb 56.1% Calc posteriors
01:39 834Mb 56.7% Calc posteriors
01:40 841Mb 57.3% Calc posteriors
01:41 862Mb 57.8% Calc posteriors
01:42 866Mb 58.3% Calc posteriors
01:43 879Mb 58.9% Calc posteriors
01:44 884Mb 59.4% Calc posteriors
01:45 891Mb 59.9% Calc posteriors
01:46 903Mb 60.4% Calc posteriors
01:47 925Mb 60.9% Calc posteriors
01:48 943Mb 61.5% Calc posteriors
01:49 970Mb 61.9% Calc posteriors
01:50 981Mb 62.4% Calc posteriors
01:51 982Mb 63.0% Calc posteriors
01:52 986Mb 63.5% Calc posteriors
01:53 989Mb 64.0% Calc posteriors
01:54 1.0Gb 64.5% Calc posteriors
01:55 1.0Gb 65.0% Calc posteriors
01:56 901Mb 65.5% Calc posteriors
01:57 902Mb 66.0% Calc posteriors
01:58 904Mb 66.5% Calc posteriors
01:59 910Mb 67.1% Calc posteriors
02:00 919Mb 67.6% Calc posteriors
02:01 923Mb 68.2% Calc posteriors
02:02 924Mb 68.7% Calc posteriors
02:03 930Mb 69.1% Calc posteriors
02:04 929Mb 69.7% Calc posteriors
02:05 932Mb 70.2% Calc posteriors
02:06 887Mb 70.7% Calc posteriors
02:07 929Mb 71.2% Calc posteriors
02:08 927Mb 71.7% Calc posteriors
02:09 932Mb 72.3% Calc posteriors
02:10 935Mb 72.9% Calc posteriors
02:11 931Mb 73.4% Calc posteriors
02:12 814Mb 73.9% Calc posteriors
02:13 817Mb 74.5% Calc posteriors
02:14 827Mb 75.1% Calc posteriors
02:15 850Mb 75.7% Calc posteriors
02:16 854Mb 76.3% Calc posteriors
02:17 862Mb 76.9% Calc posteriors
02:18 873Mb 77.5% Calc posteriors
02:19 882Mb 78.1% Calc posteriors
02:20 886Mb 78.7% Calc posteriors
02:21 898Mb 79.2% Calc posteriors
02:22 916Mb 79.7% Calc posteriors
02:23 923Mb 80.3% Calc posteriors
02:24 924Mb 80.8% Calc posteriors
02:25 929Mb 81.3% Calc posteriors
02:26 929Mb 81.9% Calc posteriors
02:27 915Mb 82.5% Calc posteriors
02:28 894Mb 83.0% Calc posteriors
02:29 863Mb 83.6% Calc posteriors
02:30 816Mb 84.0% Calc posteriors
02:31 812Mb 84.6% Calc posteriors
02:32 800Mb 85.1% Calc posteriors
02:33 796Mb 85.7% Calc posteriors
02:34 788Mb 86.3% Calc posteriors
02:35 795Mb 86.8% Calc posteriors
02:36 795Mb 87.3% Calc posteriors
02:37 802Mb 87.8% Calc posteriors
02:38 823Mb 88.2% Calc posteriors
02:39 826Mb 88.8% Calc posteriors
02:40 829Mb 89.4% Calc posteriors
02:41 845Mb 90.0% Calc posteriors
02:42 849Mb 90.6% Calc posteriors
02:43 850Mb 91.1% Calc posteriors
02:44 854Mb 91.7% Calc posteriors
02:45 866Mb 92.3% Calc posteriors
02:46 869Mb 92.9% Calc posteriors
02:47 878Mb 93.4% Calc posteriors
02:48 896Mb 94.0% Calc posteriors
02:49 897Mb 94.6% Calc posteriors
02:50 903Mb 95.1% Calc posteriors
02:51 904Mb 95.7% Calc posteriors
02:52 905Mb 96.3% Calc posteriors
02:53 906Mb 96.9% Calc posteriors
02:54 797Mb 97.5% Calc posteriors
02:55 799Mb 98.1% Calc posteriors
02:56 800Mb 98.7% Calc posteriors
02:57 807Mb 99.4% Calc posteriors
02:58 804Mb 99.9% Calc posteriors
02:58 799Mb 100.0% Calc posteriors
## 02:58 799Mb 0.0043% Consistency (1/2)
02:59 834Mb 4.5% Consistency (1/2)
03:00 843Mb 10.3% Consistency (1/2)
03:01 854Mb 16.3% Consistency (1/2)
03:02 866Mb 23.4% Consistency (1/2)
03:03 880Mb 31.5% Consistency (1/2)
03:04 894Mb 39.8% Consistency (1/2)
03:05 908Mb 48.2% Consistency (1/2)
03:06 923Mb 57.0% Consistency (1/2)
03:07 935Mb 63.7% Consistency (1/2)
03:08 944Mb 69.2% Consistency (1/2)
03:09 954Mb 75.3% Consistency (1/2)
03:10 966Mb 82.1% Consistency (1/2)
03:11 976Mb 88.2% Consistency (1/2)
03:12 988Mb 95.0% Consistency (1/2)
03:12 996Mb 100.0% Consistency (1/2)
## 03:12 996Mb 0.0043% Consistency (2/2)
03:13 996Mb 0.61% Consistency (2/2)
03:14 996Mb 5.9% Consistency (2/2)
03:15 996Mb 11.4% Consistency (2/2)
03:16 996Mb 17.8% Consistency (2/2)
03:17 996Mb 25.2% Consistency (2/2)
03:18 996Mb 32.1% Consistency (2/2)
03:19 996Mb 39.6% Consistency (2/2)
03:20 996Mb 46.8% Consistency (2/2)
03:21 996Mb 53.6% Consistency (2/2)
03:22 996Mb 60.3% Consistency (2/2)
03:23 996Mb 67.0% Consistency (2/2)
03:24 996Mb 74.6% Consistency (2/2)
03:25 996Mb 82.2% Consistency (2/2)
03:26 996Mb 90.1% Consistency (2/2)
03:27 996Mb 96.4% Consistency (2/2)
03:27 996Mb 100.0% Consistency (2/2)
## 03:27 997Mb 0.46% UPGMA5
03:27 997Mb 100.0% UPGMA5
## 03:28 998Mb 1.0% Refining
03:29 997Mb 17.0% Refining
03:30 998Mb 35.0% Refining
03:31 1000Mb 55.0% Refining
03:32 1.0Gb 74.0% Refining
03:33 1.0Gb 94.0% Refining
03:33 999Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 8 seqs, avg length 322, max 326
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.4Mb 3.6% Calc posteriors
00:00 84Mb 100.0% Calc posteriors
## 00:00 88Mb 3.6% Consistency (1/2)
00:00 88Mb 100.0% Consistency (1/2)
## 00:00 88Mb 3.6% Consistency (2/2)
00:00 88Mb 100.0% Consistency (2/2)
## 00:00 88Mb 14.3% UPGMA5
00:00 88Mb 100.0% UPGMA5
## 00:00 88Mb 1.0% Refining
00:00 89Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 118 seqs, avg length 370, max 370
##
## 00:00 2.8Mb CPU has 8 cores, running 8 threads
## 00:00 3.1Mb 0.014% Calc posteriors
00:01 46Mb 0.13% Calc posteriors
00:02 177Mb 2.3% Calc posteriors
00:03 214Mb 4.5% Calc posteriors
00:04 237Mb 6.5% Calc posteriors
00:05 264Mb 8.5% Calc posteriors
00:06 271Mb 10.5% Calc posteriors
00:07 280Mb 12.6% Calc posteriors
00:08 299Mb 14.5% Calc posteriors
00:09 304Mb 16.3% Calc posteriors
00:10 319Mb 18.0% Calc posteriors
00:11 320Mb 19.8% Calc posteriors
00:12 347Mb 21.6% Calc posteriors
00:13 372Mb 23.4% Calc posteriors
00:14 379Mb 25.2% Calc posteriors
00:15 386Mb 27.1% Calc posteriors
00:16 419Mb 28.8% Calc posteriors
00:17 426Mb 30.6% Calc posteriors
00:18 443Mb 32.2% Calc posteriors
00:19 465Mb 33.7% Calc posteriors
00:20 489Mb 35.2% Calc posteriors
00:21 493Mb 36.8% Calc posteriors
00:22 500Mb 38.4% Calc posteriors
00:23 522Mb 40.4% Calc posteriors
00:24 538Mb 42.3% Calc posteriors
00:25 547Mb 44.2% Calc posteriors
00:26 567Mb 45.8% Calc posteriors
00:27 574Mb 47.9% Calc posteriors
00:28 595Mb 49.8% Calc posteriors
00:29 629Mb 51.7% Calc posteriors
00:30 655Mb 53.6% Calc posteriors
00:31 662Mb 55.7% Calc posteriors
00:32 688Mb 57.5% Calc posteriors
00:33 697Mb 59.5% Calc posteriors
00:34 703Mb 61.4% Calc posteriors
00:35 716Mb 63.4% Calc posteriors
00:36 747Mb 65.2% Calc posteriors
00:37 796Mb 67.1% Calc posteriors
00:38 830Mb 68.9% Calc posteriors
00:39 834Mb 70.6% Calc posteriors
00:40 841Mb 72.1% Calc posteriors
00:41 847Mb 73.8% Calc posteriors
00:42 857Mb 75.4% Calc posteriors
00:43 861Mb 77.1% Calc posteriors
00:44 865Mb 78.8% Calc posteriors
00:45 878Mb 80.6% Calc posteriors
00:46 884Mb 82.3% Calc posteriors
00:47 891Mb 83.9% Calc posteriors
00:48 900Mb 85.0% Calc posteriors
00:49 909Mb 86.9% Calc posteriors
00:50 924Mb 88.6% Calc posteriors
00:51 930Mb 90.4% Calc posteriors
00:52 940Mb 92.1% Calc posteriors
00:53 949Mb 93.8% Calc posteriors
00:54 966Mb 95.5% Calc posteriors
00:55 976Mb 97.3% Calc posteriors
00:56 988Mb 99.0% Calc posteriors
00:56 988Mb 100.0% Calc posteriors
## 00:56 988Mb 0.014% Consistency (1/2)
00:57 993Mb 10.0% Consistency (1/2)
00:58 1.0Gb 56.0% Consistency (1/2)
00:59 1.0Gb 89.1% Consistency (1/2)
00:59 1.0Gb 100.0% Consistency (1/2)
## 00:59 1.0Gb 0.014% Consistency (2/2)
01:00 1.0Gb 35.0% Consistency (2/2)
01:01 1.0Gb 83.7% Consistency (2/2)
01:01 1.0Gb 100.0% Consistency (2/2)
## 01:01 1.0Gb 0.85% UPGMA5
01:01 1.0Gb 100.0% UPGMA5
## 01:01 1.0Gb 1.0% Refining
01:02 1.0Gb 37.0% Refining
01:03 1.1Gb 99.0% Refining
01:03 1.1Gb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 10 seqs, avg length 331, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.4Mb 2.2% Calc posteriors
00:00 103Mb 100.0% Calc posteriors
## 00:00 103Mb 2.2% Consistency (1/2)
00:00 104Mb 100.0% Consistency (1/2)
## 00:00 104Mb 2.2% Consistency (2/2)
00:00 104Mb 100.0% Consistency (2/2)
## 00:00 104Mb 11.1% UPGMA5
00:00 104Mb 100.0% UPGMA5
## 00:00 104Mb 1.0% Refining
00:00 104Mb 100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.09 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
## 0.10 seconds: ME NNI round 2 of 31, 101 of 215 splits, 23 changes (max delta 0.002)
## 0.23 seconds: SPR round 1 of 2, 201 of 432 nodes
## 0.35 seconds: SPR round 1 of 2, 401 of 432 nodes
## 0.51 seconds: SPR round 2 of 2, 201 of 432 nodes
## 0.64 seconds: SPR round 2 of 2, 401 of 432 nodes
## Total branch-length 2.197 after 0.67 sec
## 0.74 seconds: ML NNI round 1 of 16, 1 of 215 splits
## 0.92 seconds: ML NNI round 1 of 16, 201 of 215 splits, 52 changes (max delta 7.090)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.94
## 1.07 seconds: Optimizing GTR model, step 3 of 12
## 1.18 seconds: Optimizing GTR model, step 5 of 12
## 1.29 seconds: Optimizing GTR model, step 7 of 12
## 1.44 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
## 1.56 seconds: ML Lengths 1 of 215 splits
## 1.67 seconds: Site likelihoods with rate category 5 of 20
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 1.89 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## 1.99 seconds: ML NNI round 2 of 16, 201 of 215 splits, 63 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 2.02
## Turning off heuristics for final round of ML NNIs (converged)
## 2.17 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
## 2.31 seconds: ML NNI round 3 of 16, 201 of 215 splits, 57 changes (max delta 1.019)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 2.34 (final)
## Optimize all lengths: LogLk = -5298.540 Time 2.41
## 2.57 seconds: ML split tests for 100 of 214 internal splits
## 2.74 seconds: ML split tests for 200 of 214 internal splits
## Total time: 2.76 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.03
## Total time: 0.03 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.03 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
## 0.13 seconds: SPR round 1 of 2, 201 of 234 nodes
## 0.24 seconds: SPR round 2 of 2, 201 of 234 nodes
## Total branch-length 1.596 after 0.28 sec
## 0.38 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.40
## 0.51 seconds: Optimizing GTR model, step 5 of 12
## 0.63 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
## 0.74 seconds: ML Lengths 101 of 116 splits
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 0.90 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 0.93
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 1.00
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 1.06
## Turning off heuristics for final round of ML NNIs (converged)
## 1.05 seconds: ML NNI round 5 of 14, 1 of 116 splits
## 1.17 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 1.20 (final)
## Optimize all lengths: LogLk = -3814.284 Time 1.24
## 1.42 seconds: ML split tests for 100 of 115 internal splits
## Total time: 1.45 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.01 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.05
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.06 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.07
## Total time: 0.08 seconds Unique: 9/10 Bad splits: 0/6
Generated trees arre edited using treeio and plotted
with ggtree. The trees were rerooted to their respectives
Unmutated Common Ancestor (UCA), which in this case is defined as just
the V and J gene germlines combined. The gap between V-J is inserted
automatically by the alignment method, thus the CDR3 here is not
considered for the UCA.
# function to root tree in UCA
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}
ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)
names(ls) <- lapply(ls, function(x) {
if (stringr::str_count(x, "LOR") > 1) {
substr(x, 24, 34)
}else if (grepl("LC",x)) {
substr(x, 24, 31)
}else if (grepl("LOR",x)) {
substr(x, 24, 28)
}else{
substr(x, 24,33)
}})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)
for(i in seq_along(plots)){
print(i)
plot_name <- names(plots)[i]
plots_edit <- plots[[i]]$data %>%
mutate(new_label = ifelse(grepl("sc_|LOR",label), gsub("sc_", "",label), ""),
new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
warn_missing = FALSE),
new_label = ifelse(grepl("LOR",new_label), new_label, ""),
name = label,
timepoint = case_when(grepl("B2-igg",name) ~ "B2",
grepl("B1-igg",name) ~ "B1",
grepl("sc_|LOR",name) ~ "single_cell",
grepl("igm",name) ~ "PV",
TRUE ~ "intersects"),
name = gsub("sc_", "", label))
if(stringr::str_count(plot_name, "LOR") > 1){
plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name")
}else{
plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name")
}
# shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")
gg_plot <- ggtree(plots_edit,aes(color = timepoint)) +
{if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
else geom_tippoint(aes(size = duplicated), shape = 18)}+
# geom_tippoint(aes(size = duplicated), shape = 23) +
geom_tiplab(aes(label = new_label), hjust = -.2) +
labs(size = "Count", shape = "Shape", color = "Timepoint") +
scale_color_manual(values = colors_timepoints) +
coord_cartesian(clip = 'off') +
# scale_shape_manual(values = shapes_timepoints) +
scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
geom_treescale(width = .05)
print(gg_plot)
ggsave(gg_plot, filename = paste0("../", result.dir, names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 3, 4), height = ifelse(grepl("LC", plot_name), 3, 6))
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
Query LOR24 amino acid sequence in bulk repertoire sequencing from each animal. Query was done without alignment due to large number of sequences, but string distance was calculated using Levenstein distance.
The entire dataset was loaded, filtered and used to calculate
distances to LOR24. The output was generated and saved in
.rds format to avoid recaculation for each run.
processed_data <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
processed_data <- processed_data %>% select(grp, name, ID, timepoint, new_name, name, V_SHM, VDJ_aa)
LOR24aa <- "QLQLQESGPGLVKSSETLPLTCAVSGDSISSSYWSWIRQAPGKGLEWIGYIYGSGSYSHYNPSLKSRVTLSVDTSKNQFFLRLNSVTVADTAVYYCARGGRGNTYSWNRFDVWGPGVLVTVSS"
# calculate string distance between LOR24 and all the sequences
identity_matrix <- stringdist::stringdist(gsub("\\*", "",processed_data$VDJ_aa), LOR24aa, method = "lv", nthread = 8)
# calculate the percentage of identity, normalize by LOR24 length
processed_data <- processed_data %>%
mutate(identity = (1-identity_matrix/nchar(LOR24aa)) * 100) %>%
select(V_SHM, identity, ID)
saveRDS(processed_data, "../data/clonotypes/individualized/lor24_identity_calculation.rds")
The processed dataset above was used as an input to plot a 2d-histogram. This plot includes the somatic hypermutation percentage vs the identity to LOR24 amino acid sequences. In essence, this plot shows how a set of sequences across different non-human primates were evolving to become more identical to LOR24.
processed_data <- readRDS("../data/clonotypes/individualized/lor24_identity_calculation.rds")
# plot a 2d-histogram of mutations and identity to LOR24
processed_data %>%
filter(identity >= 70) %>%
ggplot(aes(x = V_SHM, y = identity)) +
geom_bin2d(bins = 30) +
scale_fill_viridis_c(trans = "log10") +
theme_prism() +
# scale_y_continuous(expand=c(0,1),limits=c(0,101)) +
labs(x = "% Divergence from germline", y = paste0("% Identity to LOR24aa"),
fill = "Sequence counts\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12)) +
facet_wrap(~ID)
ggsave(filename = paste0("../", result.dir,"LOR24_2d-histogram.pdf"))
LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.
data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
data_comp_auc$IC50_RSV_A2 <- log10(data_comp_auc$IC50_RSV_A2)
# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity", "IC50_RSV_A2")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
"PreF/PostF" = "#CD87F8",
"PostF" = "#92CDD6",
"w.b." = "grey20"), epitope = c(
"Ø" = "#F3B084",
"Ø/V" = "pink",
"V" = "salmon",
"II/V" = "#F5B350",
"III" = "#A9D08D",
"IV" = "#FAD0FF",
"I/IV" = "#cbc9f3",
"II" = "#FFD966",
"I" = "#9BC1E6",
"foldon" = "black",
"UK-Pre" = "#C9C9C9",
"UK-DP" = "#8497B0" ,
"WB" = "grey95",
"UK-PostF" = "grey40"))
{
g_heatmap_scale <- data_comp_auc %>%
select(-all_of(to_remove)) %>%
mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
t() %>%
pheatmap::pheatmap(scale = "none", angle_col = 90, cutree_cols = 8,
clustering_method = "ward.D",
color = viridisLite::viridis(100),
cluster_rows = FALSE, border_color = "grey40",
cluster_cols = TRUE,
legend_breaks = c(0, 0.5, 1, 1.5, max(.)),
legend_labels = c("0", "0.5", "1", "1.5", "Normalized\ncompetition\n"),
annotation_col = data_comp_auc[,13:14],
annotation_colors = annot_colors,
cellwidth = 5, cellheight = 5, fontsize_row = 5, fontsize_col = 6, fontsize = 6)
ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 18, height = 12, units = "cm")
}
This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.
# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))
data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name")
to_remove <- c("epitope", "specificity", "IC50_RSV_A2")
# compute MDS
mds_scaled <- data_comp_auc %>%
select(-all_of(c(to_remove, "name"))) %>%
scale(center = TRUE, scale = TRUE) %>%
dist(method = "euclidean") %>%
cmdscale() %>%
as_tibble()
colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
p1 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_manual(values = annot_colors[["epitope"]]) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p1)
ggsave(plot = p1, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-sites.pdf"))
p2 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = IC50_RSV_A2)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_viridis_c() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
labs(color = "IC50 RSV\n(log10)")+
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1)
plotly::ggplotly(p2)
ggsave(plot = p2, width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-neuts.pdf"))
mabs_lors <- mabs_lors %>%
filter(name %in% lor_mabs$well_ID) %>%
mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
arrange(mAbs_ID) %>%
relocate(mAbs_ID)
mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp),
mabs_lors[mabs_lors$name == i,]$mAbs_ID,
NA)
}
col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])
mabs_clonal_rank <- mabs_clonal_rank %>%
mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
arrange(LOR) %>%
mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
filter(ID == well_ID)
mabs_lors <- mabs_lors %>%
mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)
# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2, clonal size sc = 1), LOR40 (not detected B2, clonal size sc = 1 ), LOR42 (same clone as LOR01), LOR94 (same clone as LOR01), LOR96 (not detected B2, clonal size sc = 4)
#mabs_lors %>% write.csv(paste0("../data/single_cell/LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(file = paste0("../data/single_cell/LOR_mAbs_info.csv"), sep = ",")
mabs_lors %>%
left_join(data_comp_auc %>% tibble::rowid_to_column("mAbs_ID") %>% mutate(mAbs_ID = as.character(name)), by = "mAbs_ID") %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = epitope)) +
geom_point(size = 3) +
scale_color_manual(values = annot_colors[["epitope"]]) +
scale_y_log10() +
scale_x_log10() +
labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
merged_mabs_lors <- mabs_lors %>%
left_join(clonotype_rsv, by = "name") %>%
left_join(clono_light_chains,suffix = c("_HC","_LC"), by = "name") %>%
mutate(v_call_HC = V_gene.x) %>%
fill(v_call_LC)
write.csv(merged_mabs_lors, file = "../data/single_cell/LOR_mAbs_info_full.csv", row.names = F)
merged_mabs_lors %>%
group_by(v_call_HC, v_call_LC) %>%
mutate(mabs_counts = n()) %>%
ggplot(aes(x= v_call_HC, y = v_call_LC, fill = mabs_counts)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
labs(fill = "mAb counts", x = "IGHV alleles", y = "IGHJ alleles") +
coord_equal() +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.text.x = element_text(angle = 90, size = 6, hjust = 1,
vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 6),
strip.text = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "right",
axis.ticks = element_line(size = .5),
legend.title = element_text())
renv::snapshot()
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.3.4 dplyr_1.1.0 ggprism_1.0.4
## [4] scales_1.2.1 vegan_2.6-4 lattice_0.20-45
## [7] permute_0.9-7 data.table_1.14.8 Biostrings_2.66.0
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0 IRanges_2.32.0
## [13] S4Vectors_0.36.0 BiocGenerics_0.44.0 ggpubr_0.6.0
## [16] ggplot2_3.4.1 rstatix_0.7.2 ggtree_3.6.0
## [19] treeio_1.22.0 tidyr_1.3.0 jsonlite_1.8.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-162 bitops_1.0-7 sf_1.0-7
## [4] RColorBrewer_1.1-3 webshot_0.5.4 httr_1.4.5
## [7] tools_4.2.2 backports_1.4.1 bslib_0.4.2
## [10] utf8_1.2.3 R6_2.5.1 KernSmooth_2.23-20
## [13] DBI_1.1.3 lazyeval_0.2.2 mgcv_1.8-42
## [16] colorspace_2.1-0 withr_2.5.0 tidyselect_1.2.0
## [19] compiler_4.2.2 cli_3.6.0 rvest_1.0.3
## [22] xml2_1.3.3 plotly_4.10.1 labeling_0.4.2
## [25] sass_0.4.5 classInt_0.4-9 proxy_0.4-27
## [28] systemfonts_1.0.4 stringr_1.5.0 digest_0.6.31
## [31] yulab.utils_0.0.6 rmarkdown_2.20 svglite_2.1.1
## [34] pkgconfig_2.0.3 htmltools_0.5.4 fastmap_1.1.1
## [37] highr_0.10 htmlwidgets_1.6.1 rlang_1.0.6
## [40] rstudioapi_0.14 ggVennDiagram_1.2.2 gridGraphics_0.5-1
## [43] jquerylib_0.1.4 generics_0.1.3 farver_2.1.1
## [46] crosstalk_1.2.0 car_3.1-1 RCurl_1.98-1.10
## [49] magrittr_2.0.3 ggplotify_0.1.0 GenomeInfoDbData_1.2.9
## [52] patchwork_1.1.2 Matrix_1.5-3 Rcpp_1.0.10
## [55] munsell_0.5.0 fansi_1.0.4 ape_5.7
## [58] abind_1.4-5 lifecycle_1.0.3 stringi_1.7.12
## [61] yaml_2.3.7 carData_3.0-5 MASS_7.3-58.3
## [64] zlibbioc_1.44.0 plyr_1.8.8 grid_4.2.2
## [67] ggrepel_0.9.3 parallel_4.2.2 crayon_1.5.2
## [70] splines_4.2.2 knitr_1.42 pillar_1.8.1
## [73] ggsignif_0.6.4 glue_1.6.2 evaluate_0.20
## [76] ggfun_0.0.9 vctrs_0.5.2 gtable_0.3.1
## [79] purrr_1.0.1 cachem_1.0.7 xfun_0.37
## [82] broom_1.0.4 e1071_1.7-13 tidytree_0.4.2
## [85] class_7.3-21 RVenn_1.1.0 viridisLite_0.4.1
## [88] pheatmap_1.0.12 tibble_3.2.0 aplot_0.1.10
## [91] units_0.8-1 cluster_2.1.4 ellipsis_0.3.2